import pandas as pd
import numpy as np
from plotly import express as px
import plotly.io as pio
= "iframe"
pio.renderers.default
= "https://raw.githubusercontent.com/pic16b-ucla/24W/main/datasets/palmer_penguins.csv" url
Who doesn’t love penguins? In this plotly visualization tutorial, we’ll be examining the “palmer_penguins” data set, graciously collected and published by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER (you can read more about the project and dataset here).
It contains data on 344 Anvers penguins of three species, Adelie, Chinstrap, and Gentoo, as well as various characteristics, such as their home island, the length and depth of their culmen, their flipper length, body mass, sex, and concentration of nitrogen and carbon in their bloodstream.
Let’s first import in all our necessary libraries. In this tutorial, we’ll be constructing a simple visualization using plotly express. We’ll also import plotly.io and use the renderers framework so our figure can be displayed on this webpage. We’ll also import pandas and numpy to help with our initial data wrangling. Then, we’ll save the dataset into a dataframe called penguins
.
= pd.read_csv(url)
penguins penguins
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PAL0708 | 1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 11/11/07 | 39.1 | 18.7 | 181.0 | 3750.0 | MALE | NaN | NaN | Not enough blood for isotopes. |
1 | PAL0708 | 2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 11/11/07 | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 | NaN |
2 | PAL0708 | 3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 11/16/07 | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 | NaN |
3 | PAL0708 | 4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 11/16/07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Adult not sampled. |
4 | PAL0708 | 5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 11/16/07 | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
339 | PAL0910 | 120 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N38A2 | No | 12/1/09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
340 | PAL0910 | 121 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N39A1 | Yes | 11/22/09 | 46.8 | 14.3 | 215.0 | 4850.0 | FEMALE | 8.41151 | -26.13832 | NaN |
341 | PAL0910 | 122 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N39A2 | Yes | 11/22/09 | 50.4 | 15.7 | 222.0 | 5750.0 | MALE | 8.30166 | -26.04117 | NaN |
342 | PAL0910 | 123 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N43A1 | Yes | 11/22/09 | 45.2 | 14.8 | 212.0 | 5200.0 | FEMALE | 8.24246 | -26.11969 | NaN |
343 | PAL0910 | 124 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N43A2 | Yes | 11/22/09 | 49.9 | 16.1 | 213.0 | 5400.0 | MALE | 8.36390 | -26.15531 | NaN |
344 rows × 17 columns
Data Wrangling
For this simple visualization, our goal will be to somehow distinguish the three species of penguins; Adelie, Gentoo, and Chinstrap. For this, we’ll use a 2D graph, so we’ll only need two features. For this, we’ll only use the 'Flipper Length (mm)'
and 'Culmen Length (mm)'
columns.
Note that some of the entries have NaN
or missing values for flipper length and culmen length. We can treat these in a variety of ways, but for this example, we’ll simply remove them. As seen below, there were only 2
entries with missing values compared to 342
without, so this step is not too significant.
Finally, for ease of reading, we’ll drop the “penguin” and the scientific name from the 'Species'
column.
# getting just these 3 columns
= penguins[['Species', 'Flipper Length (mm)', 'Culmen Length (mm)']]
penguins # dropping NaN values
= penguins.dropna()
penguins # getting just the species name
"Species"] = penguins["Species"].str.split().str.get(0)
penguins[
penguins
Species | Flipper Length (mm) | Culmen Length (mm) | |
---|---|---|---|
0 | Adelie | 181.0 | 39.1 |
1 | Adelie | 186.0 | 39.5 |
2 | Adelie | 195.0 | 40.3 |
4 | Adelie | 193.0 | 36.7 |
5 | Adelie | 190.0 | 39.3 |
... | ... | ... | ... |
338 | Gentoo | 214.0 | 47.2 |
340 | Gentoo | 215.0 | 46.8 |
341 | Gentoo | 222.0 | 50.4 |
342 | Gentoo | 212.0 | 45.2 |
343 | Gentoo | 213.0 | 49.9 |
342 rows × 3 columns
Visualization with Plotly
Excellent! Now to visualize, we’ll use plotly’s scatter plot. To do this, we’ll create a figure using px.scatter
. It takes in a variety of parameters, including:
- Our penguins dataframe
- What columns to plot on the x and y axes
color
: colors the points based on their species- Width and height of the plot
We’ll also create some marginal histograms, which display the distribution of the data for one variable only. On the top we’ll see the distribution of flipper lengths across species, and on the right we’ll see the distribution of culmen lengths across species.
Finally, we’ll use the fig.update_layout
function to adjust some of our plot aesthetics, by adding a title, adjusting the margins, and editing the template style.
= px.scatter(data_frame = penguins,
fig = "Flipper Length (mm)",
x = "Culmen Length (mm)",
y = "Species",
color = 800,
width = 500,
height = "histogram",
marginal_y = "histogram",
marginal_x
)
# Adjust the margins, add in a title, and set a plot template
={"r":0,"t":40,"l":0,"b":0},
fig.update_layout(margin="Culmen Length vs. Flipper Length of the Three Anvers Penguin Species",
title_text="ggplot2")
template
# Show the plot
fig.show()
Discussion
From this plot, we can see some rough distinctions between the three species based on just these two features alone. Because of plotly’s nice interactive features, you can hover over any individual point as see the penguin’s species and it’s individual measurements.
From the marginal histograms, we can even more clearly see that the flipper lengths allow us to distinguish Gentoo penguins apart fairly well, as they feature significantly higher flipper lengths on average, while the culmen lengths allow us to distinguish the Adelie penguins fairly well, as they feature significantly lower culmen lengths. Hover over any bucket and plotly will display the count of penguins within that range.
This insight could be useful for potentially training some models to predict the species of a penguin given its phenotype, track the evolution of these species across time, or assess any other general trends in the species’ populations.